1 Review of important goals

1.1 Necessary background:

  • 11% of pregnant US women reported past month cigarette use in 2018.

  • Prenatal maternal smoking is associated with negative health outcomes, including low birth weight, asthama and neurodevelopmental and behavioral issues in exposed children.

  • Prenatal maternal smoking is hard to measure: social desirability bias, serum cotinine is short-lived in pregnant women, and in adult cohorts, possibility of recall bias.

  • Finding a reliable and specific biomarker of prenatal maternal smoking is desirable.

1.1.1 Previous work

Previous work has found many associations between prenatal maternal smoking and DNA methylation, a type of epigenetic change to DNA where a methyl attaches to a CpG site.

2 Objectives and hypotheses

  1. Conduct an ancestry-stratified epigenome wide association study of prenatal maternal smoking and DNA methylation from child saliva samples at age nine and age fifteen.

  2. Examine differences in methylation trajectories from age nine to fifteen of the most significant methylation sites identified in the epigenome-wide association study between those exposed vs unexposed to prenatal smoking.

  3. Develop a saliva epigenetic classifier of prenatal smoke exposure using a machine learning approach and support vector machines and evaluate the agreement of the classifier at two time points.

Hypotheses: - I hypothesize that different CpG loci will be associated with maternal prenatal smoking in African ancestry groups than in European and Hispanic ancestry groups.

  • I hypothesize that the direction of associations between CpGs and prenatal maternal smoking will stay consistent between ages nine and fifteen but that the magnitude of associations will decline over time.

  • Finally, I hypothesize that classification of prenatal maternal smoking status based upon DNA methylation information from salivary samples will be accurate and consistent at ages nine and fifteen.

3 Reading in data and preprocessing

3.1 Labeling phenotype data

3.2 Cleaning pdqc data

pdqc_all<-readRDS(paste0(datadir, 'OGData/', "pd_qc.rds"))
pdqc<-pdqc_all %>% filter(probe_fail_10==0 & sex==predicted_sex)
#What FF ids have methylation data?
methy_ids<-pdqc$id
#create slide variable fo pdqc 
pdqc$slide <- gsub('_.*', '', pdqc$MethID)
pdqc$array<-gsub('.*_', '', pdqc$MethID)
#ids with 9visit, 15visit or both visit
visit9ids<-pdqc%>%filter(childteen=='C')%>%pull(idnum)%>%unique()
visit15ids<-pdqc%>%filter(childteen=='T')%>%pull(idnum)%>%unique()
bothvisit_ids<-visit9ids[visit9ids %in% visit15ids]
visit9only_ids<-visit9ids[!(visit9ids %in%bothvisit_ids)]
visit15only_ids<-visit15ids[!(visit15ids %in%bothvisit_ids)]
#create flag in FF for having methylation data
#and flag for having 2 visits of methylation data
FF_labeled<-FF_labeled %>% 
  mutate(MethylData=ifelse(id %in% methy_ids, "Analysis subset", "Not in analysis subset"),
         TwoVisitData=ifelse(id%in%bothvisit_ids,"Analysis subset - both visits", "Not in both visit analysis subset"))
#table(FF_labeled$MethylData)
#table(FF_labeled$TwoVisitData)
if((length(visit15only_ids)+length(visit9only_ids)+length(bothvisit_ids)!=length(FF_labeled %>% filter(MethylData=="Analysis subset")%>%pull(id)))){print("WARNING ID LENGTHS DO NOT ADD UP")}
if(length(bothvisit_ids)!=length(FF_labeled %>% filter(TwoVisitData=="Analysis subset - both visits")%>%pull(id))){print("WARNING ID LENGTHS DO NOT ADD UP")}

3.2.1 Filter technical replicates

I filtered to only 1 technical replicate per person-visit, picking the replicate with the lower probe fail percentage

#technical replicates will be duplicate id + childteen variables from pdqc
pdqc<-pdqc %>% mutate(newID=paste(idnum, childteen, sep='_'))
techrep_ids<- pdqc %>% select(newID) %>% group_by(newID)%>%filter(n()>1)
techreps<-pdqc %>% filter(newID %in% techrep_ids$newID)%>%arrange(idnum,childteen)
hiPctPF<-techreps %>% group_by(newID) %>%filter(probe_fail_pct==max(probe_fail_pct))%>%pull(MethID)
pdqc_clean<-pdqc %>% filter(!(MethID %in% hiPctPF) & childteen!='M')
myMethIDs<-pdqc_clean %>% pull(MethID)

3.2.2 Plate, slide and array data

batchData<-read.csv(file.path(datadir, 'Methylation_450K_array_batch_information.csv'))

4 Creating new variables

4.1 Creation of new sociodemographic and behavioral variables

4.1.1 Prenatal maternal smoking exposure variable: binary smoking variable

FF_labeled<-FF_labeled %>%mutate(smkPreg_binary=ifelse(m1g4%in% c("1 2+pk/d","2 1<pk<2","3 <1pk/d" ), "Yes", ifelse(m1g4%in%c("-2 Don't know", "-3 Missing", "-9 Not in wave"), NA, ifelse(m1g4== "4 None", "No", ifelse(m1g4=="Missing", "Missing", "FLAG")))))
createTable(compareGroups(m1g4~smkPreg_binary, data=FF_labeled))
## 
## --------Summary descriptives table by 'During the preg, how many cigarettes did you smoke?'---------
## 
## _______________________________________________________________________________ 
##                 1 2+pk/d   2 1<pk<2   3 <1pk/d    4 None     Missing  p.overall 
##                   N=15      N=103      N=834      N=3934      N=12              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## smkPreg_binary:                                                           .     
##     Missing     0 (0.00%) 0 (0.00%)  0 (0.00%)   0 (0.00%)  12 (100%)           
##     No          0 (0.00%) 0 (0.00%)  0 (0.00%)  3934 (100%) 0 (0.00%)           
##     Yes         15 (100%) 103 (100%) 834 (100%)  0 (0.00%)  0 (0.00%)           
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

4.1.2 Ancestry variables

Ancestry is based on csv files of ancestry-specific PCS created by Dr. Erin Ware (per Jonah meeting Oct 6 2020). If an ID was in {african/european/hispanic}, then the individual is labeled as such.

# read in PCs
euroPC<-read.csv(paste0(datadir, "OGData/pcs/european.csv"))
afrPC<-read.csv(paste0(datadir, "OGData/pcs/african.csv"))
hisPC<-read.csv(paste0(datadir, "OGData/pcs/hispanic.csv"))
allPC<-read.table(paste0(datadir, "OGData/pcs/global_pcs.txt"), col.names=c("X", "idnum", paste("PC", 1:20, sep='')))
#individuals in each ancestry specific pc group can be labeled as that ancestry for categorical ancestry variable 
FF_labeled<-FF_labeled %>% mutate(ancestry=case_when(MethylData=="Analysis subset" & id %in% hisPC$idnum~"Hispanic ancestry", MethylData=="Analysis subset" & id %in%afrPC$idnum~"African ancestry", MethylData=="Analysis subset" & id %in% euroPC$idnum~"European ancestry", MethylData=="Analysis subset" & !(id %in%c(hisPC$idnum, afrPC$idnum, euroPC$idnum)) & (id %in% allPC$idnum)~"Other ancestry", MethylData=="Analysis subset" & !(id %in%c(hisPC$idnum, afrPC$idnum, euroPC$idnum)) & !(id %in% allPC$idnum)~"Missing PC data", MethylData!="Analysis subset"~"Not in analysis subset"))
#check ancestry variable
#table(FF_labeled$ancestry)
#with(FF_labeled, xtabs(~ancestry+cm1ethrace))
MissingPCdata<-FF_labeled %>% filter(ancestry=="Missing PC data")%>%select(id)
save(MissingPCdata, file = paste0(datadir, "CreatedData/MissingPC.Rdata"))
#add PC data to FF_labeled 
FF_labeled<-left_join(FF_labeled, allPC %>% mutate(id=as.character(idnum)) %>% select(id, PC1, PC2)%>%rename_at(vars(-id), function(x) paste0('global_', x, sep='')))
## Joining, by = "id"
localPC<-rbind(euroPC, afrPC, hisPC)
FF_labeled<-left_join(FF_labeled, localPC  %>% mutate(id=as.character(idnum)) %>% select(id, PC1, PC2)%>% rename_at(vars(-id), function(x) paste0('local_', x, sep='')))
## Joining, by = "id"

There were 47 individuals without PC data. None of these individuals had data in the global PC file, so I assume they are actually missing genetic data (for whatever reason, possibly QC).

4.1.3 Other prenatal substance exposure variables

Recoded frequency of prenatal alcohol drinking and drug use into Yes/no binary variables for drinking and drug use during pregnancy.

#table(FF_labeled$m1g2) # during preg, alchohol freq
#table(FF_labeled$m1g3) # during preg, drugs freq 
#with(FF_labeled, table(m1g2, m1g3))
FF_labeled<-FF_labeled %>% mutate_at(c("m1g2", "m1g3"), .funs=funs(YesNoPreg=ifelse(.=="5 Never", "No", ifelse(. =="Missing", "Missing", "Yes"))))
with(FF_labeled, table(m1g2, m1g2_YesNoPreg))
##           m1g2_YesNoPreg
## m1g2       Missing   No  Yes
##   1 E day        0    0    8
##   2 Svl/wk       0    0   27
##   3 Svl/mn       0    0   83
##   4 <1X/mn       0    0  403
##   5 Never        0 4364    0
##   Missing       13    0    0
with(FF_labeled, table(m1g3, m1g3_YesNoPreg))
##           m1g3_YesNoPreg
## m1g3       Missing   No  Yes
##   1 E day        0    0   27
##   2 Svl/wk       0    0   32
##   3 Svl/mn       0    0   61
##   4 <1X/mn       0    0  151
##   5 Never        0 4616    0
##   Missing       11    0    0
#table(FF_labeled$m1g5) # in last year, alc/drugs interfered with work/relationships
#table(FF_labeled$m1g6) # ever sought help for drug/alc problems

4.1.4 Postnatal smoke exposure

Maternal postnatal smoking at different survey points is fairly correlated, and alluvial plots of changes over time show broad consistency, so I decided to create a single smoking variable for maternal smoking at age 1 and age 5 - if mothers were smoking at either age 1 or age 5 then “Maternal smoking at age 1 or 5”, if not smoking at age 1 and age 5 then “No maternal smoking at age 1 and age 5” and then if missing data at one age and no at the other, then “Missing” (and if missing at both ages, “Missing”). Also created a similar dose variable capturing maternal packs per day that could be used instead. Decided to use just age 1 and age 5 because at these two time points the mother was asked (as opposed to PCG, also the dataset of Fragile Families metadata is missing PCG questions from age 3 about smoking) and they both preceded the actual saliva collection for both saliva collection age points. Then use a smoking dose variable from the mother or PCG interview at wave of saliva collection as an additional second hand smoke exposure variable.

#Baseline (1) -> Year 1/Age 1 (2) ->Year 3/Age3 (3) -> Year 5 (4) ->Year 9 (5) -> year 15 (6)
#Do you smoke in past month (how many packs per day)
#Mom            ->m2j5 (m2j5a)    ->                  -> m4j18 (m4j19) /p4 -> m5g17 (m5g18) / n5f17 (n5f18) 
#PCG                              p3a23a(p3a23b)                              n5f17 (n5f18)-> p6h76 (p6h77)
#seem to be missing p3 variables (PCG survey year 3) 
#number of people in household who smoke 
#p3a23->p5h15b->p6h78

#hours child around smoke 
#p3a21 -> p4a24->p5h15c

#with(FF_labeled, table(m2j5, m4j18))
#with(FF_labeled, table(m4j18, m5g17))
#with(FF_labeled, table(n5f17, m5g17))
#with(FF_labeled, table(p6h76, m5g17))


cor_plot_postnatalsmk<-plot_grid(FF_labeled %>% select(m2j5, m4j18, m5g17, n5f17, p6h74) %>% mutate_all(funs(as.numeric(.))) %>%ggcorr(label=T)+ggtitle("Correlation between Yes/No smoking (mother) \n at years 1, 5, & 9 & PCG at years 9 & 15")
,FF_labeled %>% select(m2j5a, m4j19, m5g18, n5f18, p6h77) %>% mutate_all(funs(as.numeric(.))) %>%ggcorr(label=T)+ggtitle("Correlation between packs per day smoking (mother) \n at years 1, 5, 9 & PCG at years 9 and 15"), nrow=2)

yesnosmkalluv<-to_lodes_form(as.data.frame(FF_labeled %>% select(m2j5, m4j18, m5g17, p6h74) %>% mutate(m5g17=stringr::str_to_title(m5g17))%>% group_by(m2j5, m4j18, m5g17, p6h74) %>% count()), axes=1:4, id='Cohort')

#ggplot(yesnosmkalluv, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow(stat="alluvium", lode.guidance="backfront", color='darkgrey')+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9"))

smkdose<-to_lodes_form(as.data.frame(FF_labeled %>% select(m2j5a, m4j19, m5g18, p6h77)%>% mutate_at(vars(c("m2j5a", "m4j19", "m5g18")), funs(factor(substring(., 1, 1), labels=levels(FF_labeled$m2j5a)))) %>% group_by(m2j5a, m4j19, m5g18, p6h77) %>% count()), axes=1:4, id='Cohort')
#ggplot(smkdose, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9"))

alluv_plot_postnatalsmk<-plot_grid(ggplot(yesnosmkalluv, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9", "Age 15"))
+ggtitle("Alluvial plot of Yes/No smoking variables from age 1 to age 9 (mother) and age 15 (PCG)"), ggplot(smkdose %>% filter(!(stratum %in% c("-6 Skip", "Missing"))), aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9", "Age 15"))+ggtitle("Alluvial plot of smoking per day among those smoking from age 1 to age 9 (mother) and 15 (PCG)"), nrow=2)
## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#ggplot(FF_labeled, aes(x=p5h15b, y=p6h78))+geom_jitter()+stat_cor()+scale_x_continuous('# smoking in house age 9')+scale_y_continuous('# smoking in house age 15')+ggtitle("# individuals smoking in house correlation age 9 and age 15")
cor_plot_postnatalsmk

alluv_plot_postnatalsmk

#Make any maternal postnatal smoking variable & dose categorization for maternal smoking at ages 1 & 5 
FF_labeled<-FF_labeled %>% mutate(PostnatalMaternalSmokingAny = case_when(
                                m2j5 =="1 Yes" ~ "Maternal smoking at age 1 or age 5",
                                m4j18=="1 Yes" ~ "Maternal smoking at age 1 or age 5",
                                m2j5 == "Missing" & m4j18=="Missing" ~ "Missing",
                                m2j5 == "Missing" & m4j18=="2 No" ~ "Missing",
                                m2j5 == "2 No" & m4j18=="Missing" ~ "Missing",
                                m2j5 == "2 No" & m4j18=="2 No" ~ "No maternal smoking at age 1 and age 5")) %>%
  mutate(PostnatalMaternalSmokingDose=case_when(
                                m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d") &
                                m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5   
                                          More 2 packs/day") ~ "Mom - pack/d or greater at both age 1 & 5", 
                                m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d") &
                                !(m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5 
                                          More 2 packs/day")) ~ "Mom - pack/d or greater at either age 1 or 5",
                                !(m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d")) &
                                m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5 
                                          More 2 packs/day") ~ "Mom - pack/d or greater at either age 1 or 5",
                                m2j5a %in% c("1 <1/2 pk/d") | m4j19 %in% c("1 Less half pack/day")~"Mom - less than half pack/d at either age 1 or 5", 
                                m2j5a %in% c("-6 Skip") & m4j19 %in% c("-6 Skip")~
                                  "No maternal smoking at age 1 or 5", 
                                m2j5a %in% c("Missing") & m4j19 %in% c("-6 Skip", "Missing")~"Missing", 
                                m2j5a %in% c("Missing", "-6 Skip") & m4j19 %in% c("Missing")~"Missing"))

with(FF_labeled, table(PostnatalMaternalSmokingAny, m2j5, m4j18))
## , , m4j18 = 1 Yes
## 
##                                         m2j5
## PostnatalMaternalSmokingAny              1 Yes 2 No Missing
##   Maternal smoking at age 1 or age 5       868  305      81
##   Missing                                    0    0       0
##   No maternal smoking at age 1 and age 5     0    0       0
## 
## , , m4j18 = 2 No
## 
##                                         m2j5
## PostnatalMaternalSmokingAny              1 Yes 2 No Missing
##   Maternal smoking at age 1 or age 5       168    0       0
##   Missing                                    0    0     183
##   No maternal smoking at age 1 and age 5     0 2526       0
## 
## , , m4j18 = Missing
## 
##                                         m2j5
## PostnatalMaternalSmokingAny              1 Yes 2 No Missing
##   Maternal smoking at age 1 or age 5       123    0       0
##   Missing                                    0  371     273
##   No maternal smoking at age 1 and age 5     0    0       0
#with(FF_labeled, table(m2j5a, m4j19, PostnatalMaternalSmokingDose))
#with(FF_labeled, table(PostnatalMaternalSmokingAny, PostnatalMaternalSmokingDose))

4.2 Import and create methylation summary variables

4.2.1 Cell type proportions

Cell type proportions currently estimated using epidDISH package and the EpiFibIC reference (generic for epithelial tissues). This will be replaced with the saliva specific reference panel in Middleton et al..

betaqc<-readRDS(file=paste0(datadir, 'OGData/', "noob_filtered.rds"))

#cell type proportions 
celltypes<-epidish(beta.m = betaqc, ref.m = centEpiFibIC.m, method = "RPC")
estF_FF<-as.data.frame(celltypes$estF)
estF_FF$MethID = row.names(estF_FF)

In addition to the site-specific analysis proposed, Kelly and I decided to investigate certain DNA methylation summary metrics that are commonly used in the field. These include global methylation and methylation clocks (which are outcome-naive), and an ooutcome-predicated measure, polymethylation scores.

4.2.2 Global methylation

We calculated global methylation as the mean of the CpG site-specific methylation values.

aprioriCG<-c("cg05575921", "cg04180046", "cg05549655", "cg14179389")
globalmethy<-colMeans(betaqc, na.rm=T)
globalmethdf<-as.data.frame(cbind("MethID"=names(globalmethy), "globalmethylation"=globalmethy))
aprioriCGdf<-as.data.frame(t(betaqc[aprioriCG, ]))
aprioriCGdf$MethID = colnames(betaqc)

4.2.3 Methylation (or epigenetic) clocks

Methylation or epigenetic clocks are measures which exploit certain sets of CpG sites whose DNA methylation levels associate with subject age. These clocks are highly accurate molecular correlates of chronological age. However, variation in epigenetic clock measures among individuals of the same chronological age do exist and are assumed to reflect biological age.

Four different methylation clocks were created, based on different papers that used different tissue types and age groups for their training samples: Horvath13, Horvath18, Pediatric and Levine. Since this is a pediatric population, the pediatric clock is probably the best bet, but Jonah also suggested we look at the Levine clock.

clocks<-read.csv(file=paste0(datadir, "OGData/ffcw_n1776_8clocks.csv"), header = T)
colnames(clocks)[1]<-"MethID"

4.2.4 Polymethylation score

Here, the idea is similar to a polygenetic score. Essentially, we want to create a summary metric of CpG sites that have associated with an exposure or outcome of interest in a previous independent cohort. We identify a previous study which conducted a regression analysis of our exposure of interest on CpG sites and extract those regression coefficients. The regression coefficients become weights for the CpG site values in our cohort. We sum the weighted CpG site values across sites for each sample to come up with a single per-sample measure of exposure-related methylation.

\(Score_i=\sum^{m}_j Weight_j*CpG_{ij}\) where

\(i\) indexes samples in our study; \(j\) indexes CpG sites

\(Weight_j\) is the regression coefficient from the independent external study for that \(j\) CpG site

\(CpG_{ij}\) is the CpG methylation value for that \(j\) CpG in that \(i\) sample in our study

\(Score_i\) is the polymethylation score.

Because polymethylation scores are not widely used (yet), there isn’t a standardized approach. Importantly, much of the time researchers use the (confusingly named) ‘beta’ value to measure methylation at a CpG site, which is the ratio of the methylated probe intensity to the overall intensity and ranges from 0-1. We were concerned that this could lead to sites with more methylation (closer to 1) being upweighted in the resulting score), so we implemented versions of the score where the \(CpG_{ij}\) values were first mean-centered or z-score standardized before being used in the score.

For this analysis, we used a 2016 genome-wide consortium meta-analysisof prenatal maternal smoking and DNA methylation in blood samples, which used the same DNA methylation array. This study actually provided regression coefficients for all CpGs statistically significant after FDR correction (\(m_{CpG}=6073\)) in 4 different regressions:

  1. Regression with sustained prenatal maternal smoking as the exposure, methylation in newborn cord blood samples as outcomes
  2. Regression 1 adjusted for cell type confounding
  3. Regression of sustained prenatal maternal smoking as the exposure, methylation in peripheral blood samples of older children as outcomes
  4. Regression of any prenatal maternal smoking es the exposure, methylation in newborn cordblood as outcomes

These regression used normalized betas as the outcomes but found very little variation in results when using raw betas as the outcome (Spearman \(\rho\)=0.96 for regression coefficients, \(\rho\)=0.98 for \(log_{10}(pvalues\)))

#read in cpgs from PMC4833289
cpgs<-read.csv(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), stringsAsFactors = F, header=F, skip=2)
#make headers
header1<-scan(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), nlines=1, what=character(), sep=',')
header1[7:26]<-c(rep("SSnewborn", 5), rep("SSnewbornCT", 5), rep("SSolder", 5), rep("anynewborn", 5))
header2<-paste(header1, scan(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), nlines=1, skip=1, what=character(), sep=','), sep='_')
names(cpgs)<-header2

#four meta-EWASes: SS_newborn (sustained smoking in newborns); SS_newbornCT (sustained smoking in newborns controlling for cell type) SS_older (sustained smoking in older children); any_newborn (any smoking in newborns)
#pivot_longer to get single column of coefs and a column for the analysis type
#and group into a list by analysis type
cpgs<-cpgs %>% pivot_longer(cols = c(matches("_Coef|_SE|_P|_FDR|_Direction")), names_to = c("set", ".value"), names_pattern="(.+)_(.+)")%>%rename_with(~gsub('_', '', .x))%>%group_split(set)
cpgs<-cpgs%>%setNames(lapply(cpgs, function(x){x$set %>% unique()}))

#make scores - no transform, zscore and centermean
#first, pull script 
source(file.path(codedir, 'UsefulCode', 'makePolyEpiScores.R'))
scores=makePolyEpiScore(m=betaqc, b=cpgs)
## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
scores_z=makePolyEpiScore(m=betaqc, b=cpgs, transformopt = 'zscore')
## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
scores_center=makePolyEpiScore(m=betaqc, b=cpgs, transformopt = 'meancenter')
## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
polyscores=list('notransform'=scores, 'zscore'=scores_z, 'center'=scores_center)
polyscores<-lapply(polyscores, function(x) x%>% mutate(MethID=colnames(betaqc)))
save(polyscores, file = file.path(datadir, 'CreatedData', 'polymethylationscores.RDS'))
polyscores_wide<-lapply(seq_along(polyscores), function(i) polyscores[[i]] %>% setNames(c(paste0(colnames(polyscores[[i]])[1:4], '_', names(polyscores)[i]), 'MethID')))%>%reduce(left_join, by='MethID')
clocks2<-clocks %>% mutate(idnum=gsub('a', '', idnum))%>%filter(MethID %in% pdqc$MethID)%>%select(MethID, idnum, childteen, horvath, pediatric, levine, grim) # currently clocks are missing 3 individuals who were in johns pipeline but not jonahs 

methyldata<-left_join(left_join(left_join(left_join(left_join(left_join(pdqc_clean %>% select(MethID, slide, array,childteen, idnum), globalmethdf), clocks2), estF_FF), aprioriCGdf), polyscores_wide), batchData)
## Joining, by = "MethID"
## Joining, by = c("MethID", "childteen", "idnum")
## Joining, by = "MethID"
## Joining, by = "MethID"
## Joining, by = "MethID"
## Joining, by = "MethID"
methyldata[, c("globalmethylation", aprioriCG)]<-lapply(methyldata[, c("globalmethylation", aprioriCG)],function(x) as.numeric(as.character(x)))
#lapply(methyldata, class)
methyldata_wide <- methyldata %>% select(-MethID)%>% pivot_wider(., names_from=childteen, values_from=any_of(colnames(methyldata)[4:length(colnames(methyldata))]))
#weirdID<-methyldata %>% filter(is.na(globalmethylation))%>%pull(MethID)
#globalmethy[weirdID] #not actually missing global methylation data
#clocks %>% filter(MethID%in%weirdID) # #idnum from  not the same as in
#pdqc_clean %>% filter(MethID==weirdID) #this is just a typo in the idnum from data entry, see format of other #idnums
#pdqc_clean %>% pull(idnum) %>% head()
#solution: when creating methyldata, first trim alphabet from clocks$idnum that contain alphabetic characters --> implemented
#also misssing clocks for 3 sex outlier samples that weren't in jonahs' pipeline: 
table(is.na(methyldata$horvath13))
weirdIDs<-methyldata %>% filter(is.na(horvath13))%>%pull(MethID)
#pdqc_all %>% filter(MethID %in% weirdID) %>% xtabs(~predicted_sex_outlier, data=.)

4.3 Distributions & checks of methylation variables

4.3.1 Add methylation data to metadata

varLabels2<-c(paste(colnames(FF_labeled)[1:66], varLabels), "Has any methylation data?", "Has two visits of methylation data", "Any maternal smoking during pregnancy", "Ancestry from PCA of child's genetic data", "PC 1 (global)", "PC 2 (global)", "PC 1 (within ancestry category)", "PC 2 (within ancestry category)",
"Any alcohol during pregnancy", "Any drugs during pregnancy", "Any postnatal maternal smoking (age 1 & 5)", "Postnatal maternal smoking dose at age 1 & 5")
FF_labeled<-set_label(FF_labeled, varLabels2)
  
myFF<-FF_labeled %>% filter(MethylData=="Analysis subset")%>%mutate(idnum=id)
myFF<-left_join(myFF, methyldata)
## Joining, by = "idnum"
varLabels3<-c(varLabels2, "ID", "Methylation ID", "Slide", "Array (RC)", "Visit", "Global methylation", "horvath methylation clock", "pediatric methylation clock", "levine methylation clock", "GRIM methylation clock", "Epithelial cell proportion", "Fibroid cell proportion", "Immune cell proportion", "cg05575291 methylation", "cg04180046 methylation",  "cg05549655 methylation", "cg14179389 methylation", "PES: Any smoking, newborn cordblood", "PES: Sustained smoking, newborn cordblood", "PES: Sustained smoking, newborn cordblood (+CT control)", "PES: Sustained smoking, older children peripheral blood", "PES: Any smoking, newborn cordblood - zscore", "PES: Sustained smoking, newborn cordblood - zscore", "PES: Sustained smoking, newborn cordblood (+CT control) - zscore", "PES: Sustained smoking, older children peripheral blood - zscore", "PES: Any smoking, newborn cordblood - mean center", "PES: Sustained smoking, newborn cordblood - mean center", "PES: Sustained smoking, newborn cordblood (+CT control) - mean center", "PES: Sustained smoking, older children peripheral blood - mean center", 'Plate', 'Slide', 'Array')
myFF<-set_label(myFF, varLabels3)

4.4 Correlation and distribution of global methylation and epigenetic clocks

chart.Correlation(myFF %>%select(globalmethylation, levine, pediatric, grim))

4.4.1 Correlation and distribution of polymethylation scores

chart.Correlation(myFF %>%select(SSnewborn_notransform, SSnewbornCT_notransform, SSolder_notransform, anynewborn_notransform))

postscript(file.path(outputdir, 'CorrelationPMSWeights.ps'))
chart.Correlation(myFF %>%select(SSnewborn_notransform, SSnewbornCT_notransform, SSolder_notransform, anynewborn_notransform))
dev.off()
## png 
##   2

4.4.2 Correlation and distribution of polymethylation score coefficients

coefs_methyl<-cbind('SSolderCoefs'=cpgs$SSolder$Coef, 'SSnewbornCTCoefs'=cpgs$SSnewbornCT$Coef, 'SSnewbornCoefs'=cpgs$SSnewborn$Coef, 'anynewbornCoefs'=cpgs$anynewborn$Coef)
chart.Correlation(coefs_methyl)

4.4.3 Distributions of poly methylation scores with different transformations on beta methylation matrix

#lapply(polyscores, function(x){summary(x)})
#lapply(polyscores, function(x){apply(x, 2, sd)})
dist_plots=lapply(seq_along(polyscores), function(i){polyscores[[i]]%>%pivot_longer(cols=1:4, names_to='Score_Type', values_to='Score')%>%ggplot(aes(x=Score))+geom_histogram()+facet_wrap(~Score_Type)+ggtitle(names(polyscores)[[i]])+theme_classic()})
plot_grid(dist_plots[[1]], dist_plots[[2]], dist_plots[[3]], ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

transform_labels<-c('Mean center', 'No transformation', 'Z-score standardization')
names(transform_labels)<-c('center', 'notransform', 'zscore')
postscript(file.path(outputdir, 'DistributionPMStransformation.ps'))
polyscores %>% bind_rows(.id='transform')%>%ggplot(aes(x=SSnewbornCT))+geom_histogram()+facet_wrap(~transform, scales='free', labeller=labeller(transform=transform_labels))+theme_classic()+theme(text=element_text(size=16))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dev.off()
## png 
##   2

As Kelly thought, the zscore standardized have a larger range and the biggest standard deviations, the standard deviations of the no transform and centered methods are the same, which makes sense as \(V(aX+b)=a^2V(X)\implies V(\sum\beta_i*[x_i-\bar{x}])=V(\sum\beta_i*x_i-\beta_i*\bar{x})=V(\sum \beta_i*x_i-k)=V(\sum \beta_i*x_i)\).

4.4.4 Correlation between methylation data

methylation_data<-myFF %>% select(globalmethylation, pediatric, levine, Epi, Fib, IC, anynewborn_center, SSnewbornCT_center, SSolder_center)
chart.Correlation(methylation_data, histogram=T)

4.4.5 Methylation variables over time or by visit

#Cell type
cellbytime<-myFF %>% pivot_longer(cols=c(Fib, Epi, IC), names_to="Celltype", values_to = "Proportion")

celltype_labels<-c('Epithelial cells', 'Fibroblasts', 'Immune cells'); names(celltype_labels)<-c('Epi', 'Fib', 'IC')

ggplot(cellbytime, aes(x=childteen, y=Proportion, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+facet_wrap(~Celltype, labeller=labeller(Celltype=celltype_labels))+stat_compare_means(label="..p.signif..")+ggtitle("Cell type proportions by visit")+theme_classic()+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()

cairo_ps(file.path(outputdir, 'SpaghettiCellTypes.ps'))
ggplot(cellbytime, aes(x=childteen, y=Proportion, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~Celltype, labeller=labeller(Celltype=celltype_labels))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Proportion')+theme(text=element_text(size=16))
## `geom_smooth()` using formula 'y ~ x'
dev.off()
## png 
##   2
global<-ggplot(myFF, aes(x=childteen, y=globalmethylation, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Global methylation")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
pediatric<-ggplot(myFF, aes(x=childteen, y=pediatric, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Pediatric methylation clock")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
levine<-ggplot(myFF, aes(x=childteen, y=levine, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Levine methylation clock")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
plot_grid(global, pediatric, levine, nrow=1)
## Warning: Removed 3 rows containing non-finite values (stat_ydensity).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_compare_means).
## Warning: Removed 3 rows containing non-finite values (stat_ydensity).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_compare_means).

#ggplot(myFF, aes(x=childteen, y=cg05575921, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Methylation at cg05575921 by visit")
any<-ggplot(myFF, aes(x=childteen, y=anynewborn_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Any smoking, \n newborn polymethylation score")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
SSnewborn<-ggplot(myFF, aes(x=childteen, y=SSnewbornCT_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Sustained smoking, \n newborn CT polymethylation score ")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
SSolder<-ggplot(myFF, aes(x=childteen, y=SSolder_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Sustained smoking, \n older children polymethylation score")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
plot_grid(any, SSnewborn,SSolder, ncol=3)

Cell type proportions do differ between 9 and 15 year visits and although global methylation does not look different between these visits, methylation clocks do (a good check).

Possibly higher polymethylation scores at teen visit, although inconsistent across different score types.

4.4.6 Correlation betwen polymethylation scores over time

PMS<-methyldata %>%select(idnum, childteen, colnames(polyscores_wide), -MethID)%>%pivot_longer(cols=any_of(colnames(polyscores_wide)), names_to="PMS_type", values_to="PolymethylationScore")%>%pivot_wider(names_from = childteen, values_from=PolymethylationScore)
PMSType_label<-c('Any smoking;\n newborn', 'Sustained smoking;\n newborn', 'Sustained smoking + cell type;\n newborn', 'Sustained smoking;\n older')
names(PMSType_label)<-c('anynewborn_center', 'SSnewborn_center', 'SSnewbornCT_center', 'SSolder_center')
ggplot(PMS%>% filter(str_detect(PMS_type, '_center')), aes(x=C, y=T))+geom_point()+facet_grid(~PMS_type, labeller=labeller(PMS_type=PMSType_label))+stat_cor()+geom_smooth(method=lm)+theme_classic()+scale_x_continuous('Age 9 score')+scale_y_continuous('Age 15 score')
## Warning: Removed 300 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).

postscript(file.path(outputdir, 'ScoreCorrelationVisits.ps'))
ggplot(PMS%>% filter(str_detect(PMS_type, '_center')), aes(x=C, y=T))+geom_point()+facet_grid(~PMS_type, labeller=labeller(PMS_type=PMSType_label))+stat_cor()+geom_smooth(method=lm)+theme_classic()+scale_x_continuous('Age 9 score')+scale_y_continuous('Age 15 score')+theme(text=element_text(size=16))
## Warning: Removed 300 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
dev.off()
## png 
##   2

4.4.7 Spaghetti plots of polymethylation scores over time

methyldata %>% pivot_longer(cols=contains('_center'), names_to='measure', values_to='Score')%>%ggplot(aes(x=childteen, y=Score, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~measure, labeller=labeller(measure=PMSType_label))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Score')
## `geom_smooth()` using formula 'y ~ x'

cairo_ps(file.path(outputdir, 'SpaghettiScores.ps'))
methyldata %>% pivot_longer(cols=contains('_center'), names_to='measure', values_to='Score')%>%ggplot(aes(x=childteen, y=Score, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~measure, labeller=labeller(measure=PMSType_label))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Score')+theme(text=element_text(size=16))
## `geom_smooth()` using formula 'y ~ x'
dev.off()
## png 
##   2

4.5 Make visit specific variables

4.5.1 Child’s age at time of interview/visit

myFF %>% filter(is.na(cp6yagem) &is.na(hv6_yagem) & childteen=='T')%>%dim() #3
## [1]   3 113
myFF %>% filter(is.na(cm5b_age) &is.na(hv5_agem) &childteen=='C')%>%dim() #1 
## [1]   0 113
myFF %>% filter(!is.na(cm5b_age) & is.na(hv5_agem) & childteen=='C')%>%dim()#1
## [1]   1 113
myFF %>% filter(!is.na(cp6yagem) &is.na(hv6_yagem))%>%dim() #990
## [1] 976 113
myFF <- myFF %>% 
  mutate(ChildAgeAtIn=case_when(childteen=='C'~ cm5b_age, childteen=='T'~cp6yagem))%>%
  mutate(ChildAgeAtVisit=case_when(childteen=='C'~hv5_agem, childteen=='T'~hv6_yagem))%>% 
  mutate(ChildAgeComposite=case_when(childteen=='C' & !is.na(hv5_agem)~hv5_agem, 
         childteen=='C' & is.na(hv5_agem)~cm5b_age, 
         childteen=='T' & !is.na(hv6_yagem)~hv6_yagem, 
         childteen=='T' & is.na(hv6_yagem)~cp6yagem))

myFF %>% filter(childteen=='C')%>%group_by(ChildAgeAtVisit, ChildAgeComposite)%>%summarise(n=n())%>%filter(ChildAgeAtVisit!=ChildAgeComposite)
## `summarise()` has grouped output by 'ChildAgeAtVisit'. You can override using the `.groups` argument.
## # A tibble: 0 × 3
## # Groups:   ChildAgeAtVisit [0]
## # … with 3 variables: ChildAgeAtVisit <dbl>, ChildAgeComposite <dbl>, n <int>
myFF %>% filter(childteen=='T')%>%group_by(ChildAgeAtVisit, ChildAgeComposite)%>%summarise(n=n())%>%filter(ChildAgeAtVisit!=ChildAgeComposite)
## `summarise()` has grouped output by 'ChildAgeAtVisit'. You can override using the `.groups` argument.
## # A tibble: 0 × 3
## # Groups:   ChildAgeAtVisit [0]
## # … with 3 variables: ChildAgeAtVisit <dbl>, ChildAgeComposite <dbl>, n <int>
ggplot(myFF%>%filter(childteen=='C'), aes(cm5b_age, hv5_agem))+geom_point()+stat_cor()+ggtitle("Correlation between childs age at mom interview (9 yr) and childs age at home visit (9 yr)")
## Warning: Removed 23 rows containing non-finite values (stat_cor).
## Warning: Removed 23 rows containing missing values (geom_point).

age5resid<-myFF %>% filter(childteen=='C')%>%mutate(age5_resid=cm5b_age-hv5_agem)%>%pull(age5_resid)
hist(age5resid)

age6resid<-myFF %>% filter(childteen=='T')%>%mutate(age6_resid=cp6yagem-hv6_yagem)%>%pull(age6_resid)
ggplot(myFF%>%filter(childteen=='T'), aes(cp6yagem,hv6_yagem))+geom_point()+stat_cor()+ggtitle("Correlation between childs age at mom interview (15 yr) and childs age at home visit (15 yr)")
## Warning: Removed 498 rows containing non-finite values (stat_cor).
## Warning: Removed 498 rows containing missing values (geom_point).

hist(age6resid)

table(FF_labeled$cp6yagem-FF_labeled$hv6_yagem)
## 
##    0    5 
## 1088    1
table(FF_labeled$cm5b_age-FF_labeled$hv5_agem)
## 
##  -21  -18  -15  -14  -13  -12  -11  -10   -9   -8   -7   -6   -5   -4   -3   -2 
##    1    1    2    4    3    6   15   10    5   10   13   16   13   24   52   70 
##   -1    0    1    2    3    4    5    6    7    8   10   11   12   13   14   15 
##  123  189 1424 1222   18   12    5   15    4    4    3    2    1    1    1    3 
##   16   17   21 
##    1    1    1
table(is.na(myFF$ChildAgeComposite)) # 4
## 
## FALSE  TRUE 
##  1682     3

4.5.2 Maternal or PCG smoking dose at time of sample

myFF<-myFF %>% mutate(SmkAtVisitPastmonth=case_when(childteen=='C' & m5g17=='1 yes' & m5g18 %in% c("2 about a pack", "3 a pack and a half", "4 about 2 packs", 
                              "5 more than two  packs") ~ "Pack or more a day", 
                              childteen=='C' & m5g17=='1 yes' & m5g18=="1 less than half a pack a day" ~ "Less than pack a day", 
                              childteen=='C' & n5f17=="1 yes" & n5f18 =='2 about a pack' ~ "Pack or more a day",
                              childteen=='C' & n5f17=="1 yes" & n5f18 =="1 less than half a pack day" ~ "Less than pack a day",
                              childteen=='C' & m5g17=="1 yes" & m5g18 %in% c("-6 Skip", "Missing")~"Missing",
                              childteen == 'C' & n5f17%in%c("-7 N/A", "2 no") & m5g17=="2 no" ~ "No smoking", 
                              childteen == 'C' & n5f17%in%c("2 no") & m5g17%in%c("2 no", "Missing") ~ "No smoking", 
                              childteen == 'C' & n5f17 %in% c("Missing", "-7 N/A") & m5g17 =="Missing" ~ "Missing", 
                              childteen=='T' & p6h76 %in% c("-6 Skip", "0 None") ~ "No smoking", 
                              childteen=='T' & p6h77%in% c("1 5 or fewer cigarettes per day", "2 About half a pack a day, 10 cigarettes")~"Less than pack a day", 
                              childteen=='T' & p6h77 %in% c("3 About a pack a day, 20 cigarettes", "4 More than 1 pack a day")~"Pack or more a day",
                              childteen=='T' & p6h77 == "Missing"~"Missing"))%>%
              mutate(SmkAtVisitYesNo=case_when(SmkAtVisitPastmonth=="No smoking"~"No", 
                                               SmkAtVisitPastmonth %in% c("Less than pack a day", "Pack or more a day") ~ "Yes", 
                                               childteen=='T'& SmkAtVisitPastmonth=="Missing" & !(p6h76 %in% c("-6 Skip", "0 None", "Missing"))~"Yes",
                                               childteen=='C' & SmkAtVisitPastmonth=="Missing" & m5g17=='1 yes'~"Yes", 
                                               childteen=='C' & SmkAtVisitPastmonth=="Missing" & m5g17!='1 yes'~"Missing", 
                                               childteen=='T'& SmkAtVisitPastmonth=="Missing" & p6h76 %in% c("-6 Skip", "0 None", "Missing")~"Missing"))

myFF %>% filter(childteen=='C') %>% xtabs(~SmkAtVisitPastmonth+SmkAtVisitYesNo, data=.)
##                       SmkAtVisitYesNo
## SmkAtVisitPastmonth    Missing  No Yes
##   Less than pack a day       0   0 193
##   Missing                   13   0   0
##   No smoking                 0 544   0
##   Pack or more a day         0   0  78
myFF %>% filter(childteen=='T') %>% xtabs(~SmkAtVisitPastmonth+SmkAtVisitYesNo, data=.)
##                       SmkAtVisitYesNo
## SmkAtVisitPastmonth    Missing  No Yes
##   Less than pack a day       0   0 184
##   Missing                    3   0   2
##   No smoking                 0 628   0
##   Pack or more a day         0   0  39
myFF$SmkAtVisitPastmonth<-factor(myFF$SmkAtVisitPastmonth)

4.5.3 Child smoking

Exclude children who reported smoking at age 9 and age 15. Interestingly, the children who report smoking at age 9 do not report smoking at age 15, and the parental perception of child smoking at age 9 does not match the children who say they have ever smoked a cigarette. For now using the ‘ever smoked’ variable from age 15 for the exclusion, but there is also a ‘smoked in past month’ that we could use instead (lose fewer children).

with(myFF, table(k5f1l,childteen)) # 6 kids smoking at age 9 
##          childteen
## k5f1l       C   T
##   1 yes     6   6
##   2 no    816 835
##   Missing   7  15
with(myFF, table(p5q3cr, k5f1l)) # not the same kids whose parents say somewhat or very true that the child is using tobacco
##                               k5f1l
## p5q3cr                         1 yes 2 no Missing
##   1 not true                      12 1608      14
##   2 somewhat or sometimes true     0    4       0
##   3 very true or often true        0    2       0
##   Missing                          0   37       8
with(myFF, table(k6d40, childteen)) # 41 kids say ever smoked a cigarette at age 15
##          childteen
## k6d40       C   T
##   1 Yes    41  40
##   2 No    779 809
##   Missing   9   7
#with(myFF, summary(k6d41))
with(myFF, table(k6d42, childteen)) # of these only 13 say they've smoked in the past month 
##                         childteen
## k6d42                      C   T
##   -6 Skip                779 809
##   1 Never                 28  28
##   2 Once or twice a week  10  10
##   3 3 to 5 days a week     0   0
##   4 6 to 7 days a week     3   2
##   Missing                  9   7
#with(myFF, summary(k6d43))
with(myFF, table(k5f1l, k6d40, childteen))# not the same kids at age 9 and 15
## , , childteen = C
## 
##          k6d40
## k5f1l     1 Yes 2 No Missing
##   1 yes       1    5       0
##   2 no       40  769       7
##   Missing     0    5       2
## 
## , , childteen = T
## 
##          k6d40
## k5f1l     1 Yes 2 No Missing
##   1 yes       1    5       0
##   2 no       39  791       5
##   Missing     0   13       2
varLabels4<-c(varLabels3, "Child's age at maternal/PCG (9/15 yr) interview (months)", "Child's age at home vist (9/15 years) (months)", "Constructed child's age at home visit (in months) (9/15 years)", "Mother/PCG (9/15 yr) smoking in month prior to interview (pk/day)", "Mother/PCG (9/15 yr) smoking in month prior to interview (yes/no)")
myFF<-set_label(myFF, varLabels4)

5 Analysis

5.1 Randomization of prenatal maternal smoking across slide numbers

slide_count<-myFF %>% mutate(slide=as.factor(slide))%>% group_by(smkPreg_binary, slide, .drop=F)%>%count()%>%pivot_wider(id_cols = slide, names_from=smkPreg_binary, values_from=n)%>%arrange(Yes)%>%mutate(Proportion=Yes/(No+Yes))
hist_data<-slide_count %>% ggplot(aes(x=Proportion))+geom_histogram(binwidth = 0.1, fill='blue', alpha=0.5, breaks=seq(0, 1, 0.1))+ggtitle('Histogram proprotion smk (from data)')
hist_data_c<-slide_count %>% ggplot(aes(x=Yes))+geom_histogram( fill='blue', alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))+ggtitle('Histogram # Yes smk (from data)')

kable(slide_count)%>%kable_styling()%>%scroll_box(width='100%', height='400px')
slide No Yes Proportion
200347970068 12 0 0.0000000
200394970127 12 0 0.0000000
200397540068 12 0 0.0000000
200490750005 9 0 0.0000000
200490750009 11 0 0.0000000
200490750011 11 0 0.0000000
200490750063 8 0 0.0000000
200490750108 12 0 0.0000000
200490750124 12 0 0.0000000
200556930100 11 0 0.0000000
200556930111 10 0 0.0000000
200770460118 9 0 0.0000000
200770460172 12 0 0.0000000
200770460173 10 0 0.0000000
200863810009 12 0 0.0000000
200863810070 12 0 0.0000000
200863810114 10 0 0.0000000
200866140081 9 0 0.0000000
200866140163 12 0 0.0000000
201502620070 11 0 0.0000000
201502620071 11 0 0.0000000
201561870093 4 0 0.0000000
201561870123 12 0 0.0000000
3999984076 12 0 0.0000000
3999984094 12 0 0.0000000
3999984111 12 0 0.0000000
3999997019 12 0 0.0000000
9979858077 12 0 0.0000000
9979858078 12 0 0.0000000
9979858102 10 0 0.0000000
9979858118 12 0 0.0000000
9979858128 12 0 0.0000000
9986360014 12 0 0.0000000
9986360015 12 0 0.0000000
200397860021 10 1 0.0909091
200397860035 10 1 0.0909091
200490750003 10 1 0.0909091
200863810112 7 1 0.1250000
201502620017 10 1 0.0909091
201561870100 7 1 0.1250000
3999932043 9 1 0.1000000
9829118090 9 1 0.1000000
9829118105 9 1 0.1000000
9986360019 11 1 0.0833333
200347970051 10 2 0.1666667
200347970066 10 2 0.1666667
200347970071 10 2 0.1666667
200394970128 10 2 0.1666667
200397540074 10 2 0.1666667
200397860024 10 2 0.1666667
200397860046 7 2 0.2222222
200397860066 9 2 0.1818182
200397870011 8 2 0.2000000
200400320071 10 2 0.1666667
200490750001 10 2 0.1666667
200490750006 7 2 0.2222222
200490750021 10 2 0.1666667
200490750053 9 2 0.1818182
200490750074 8 2 0.2000000
200490750079 5 2 0.2857143
200490750081 10 2 0.1666667
200490750106 7 2 0.2222222
200490750107 10 2 0.1666667
200490750148 10 2 0.1666667
200556930110 10 2 0.1666667
200556930121 9 2 0.1818182
200556930130 9 2 0.1818182
200770460063 7 2 0.2222222
200863810019 10 2 0.1666667
200863810029 10 2 0.1666667
200863810053 9 2 0.1818182
200866140164 10 2 0.1666667
200866140165 10 2 0.1666667
200866140214 10 2 0.1666667
201502620002 9 2 0.1818182
201502620003 9 2 0.1818182
201502620010 7 2 0.2222222
201502620024 8 2 0.2000000
201502620039 8 2 0.2000000
201502620073 10 2 0.1666667
201502620074 9 2 0.1818182
201502620078 10 2 0.1666667
201561870001 10 2 0.1666667
201561870002 10 2 0.1666667
201561870060 8 2 0.2000000
201561870068 8 2 0.2000000
201561870083 10 2 0.1666667
201561870090 10 2 0.1666667
201561870091 10 2 0.1666667
201561870092 10 2 0.1666667
201561870101 8 2 0.2000000
201561870126 6 2 0.2500000
201561870136 9 2 0.1818182
3999932048 9 2 0.1818182
3999932050 10 2 0.1666667
3999932051 10 2 0.1666667
3999984051 10 2 0.1666667
3999984077 10 2 0.1666667
3999984080 7 2 0.2222222
3999984088 9 2 0.1818182
3999984090 10 2 0.1666667
3999984108 10 2 0.1666667
9829118115 5 2 0.2857143
9829118137 9 2 0.1818182
9979858076 10 2 0.1666667
9979858105 10 2 0.1666667
9986360012 10 2 0.1666667
9986360033 9 2 0.1818182
200556930073 9 3 0.2500000
200556930095 9 3 0.2500000
200770460062 8 3 0.2727273
200863730056 9 3 0.2500000
201502620016 7 3 0.3000000
3999932006 5 3 0.3750000
3999932091 9 3 0.2500000
3999932092 6 3 0.3333333
200347970040 8 4 0.3333333
200347970050 6 4 0.4000000
200347970069 8 4 0.3333333
200397540092 8 4 0.3333333
200397860096 6 4 0.4000000
200490750077 8 4 0.3333333
200490750082 8 4 0.3333333
200490750090 8 4 0.3333333
200490750091 3 4 0.5714286
200490750093 8 4 0.3333333
200556930109 6 4 0.4000000
200863730043 8 4 0.3333333
200863730057 8 4 0.3333333
200866140028 6 4 0.4000000
200866140079 8 4 0.3333333
200866140238 8 4 0.3333333
3999984091 8 4 0.3333333
3999984112 8 4 0.3333333
3999997140 8 4 0.3333333
9979858131 8 4 0.3333333
9986360027 8 4 0.3333333
200490750095 7 5 0.4166667
201561870059 7 5 0.4166667
3999932036 6 5 0.4545455
200347970036 6 6 0.5000000
200397540069 6 6 0.5000000
200397860094 6 6 0.5000000
200490750080 2 6 0.7500000
200866140159 6 6 0.5000000
3999984067 6 6 0.5000000
9986360025 6 6 0.5000000
9986360046 6 6 0.5000000
200397540040 4 8 0.6666667
201561870003 4 8 0.6666667
3999984100 4 8 0.6666667
3999984114 4 8 0.6666667
#simulation
set.seed(seed=20210216)
n<-nrow(myFF)
yes<-table(myFF$smkPreg_binary)[2]; no<-table(myFF$smkPreg_binary)[1]
values<-data.frame('smk'=sample(c('No', 'Yes'), n, replace=T, prob=c(no/(no+yes), yes/(no+yes))))
values_slides<-values %>% 
    group_by((row_number()-1) %/% (n()/152)) %>%
    nest()%>%unnest(cols=c(data))
colnames(values_slides)<-c('slide', 'smk')
slide_sim<-values_slides%>%mutate(slide=as.factor(slide)) %>% group_by(smk, slide, .drop=F)%>%count()%>%pivot_wider(id_cols = slide, names_from=smk, values_from=n)%>%arrange(Yes)%>%mutate(Proportion=Yes/(No+Yes))
hist_sim<-slide_sim%>% ggplot(aes(x=Proportion))+geom_histogram(binwidth = 0.1, fill='red', alpha=0.5, breaks=seq(0, 1, 0.1))+ggtitle('Histogram proprotion smk (from simulation)')
hist_sim_c<-slide_sim%>% ggplot(aes(x=Yes))+geom_histogram(fill='red', alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))+ggtitle('Histogram # Yes smk (from simulation)')


hist_overlay<-slide_count %>% ggplot(aes(x=Proportion))+geom_histogram(fill='blue', binwidth = 0.1, alpha=0.5, breaks=seq(0, 1, 0.1))+geom_histogram(aes(x=slide_sim$Proportion), breaks=seq(0, 1, 0.1), fill='red', binwidth = 0.1, alpha=0.5)+xlim(0,1)
hist_overlay_c<-slide_count %>% ggplot(aes(x=Yes))+geom_histogram(fill='blue', alpha=0.5, breaks=c(0:12))+geom_histogram(aes(x=slide_sim$Yes), fill='red',  alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))



#plot_grid(plot_grid(hist_data, hist_sim, nrow=2), hist_overlay, nrow=1)
plot_grid(plot_grid(hist_data_c, hist_sim_c, nrow=2), hist_overlay_c, nrow=1)

5.2 Association of prenatal maternal smoking with plate

This was added 05/05/21 following 04/17/21 email from Jonah with batch variables from Princeton (thanks Jonah!)

Platechisq<-chisq.test(table(myFF$Sample_Plate, myFF$smkPreg_binary))
kable(table(myFF$Sample_Plate, myFF$smkPreg_binary))
No Yes
1 32 6
2 73 22
3 86 8
4 74 21
5 69 24
6 64 21
7 70 24
8 72 24
9 66 20
10 58 16
11 63 24
12 82 12
13 73 16
14 71 16
15 73 18
16 76 13
17 69 14
18 76 18
19 67 17
23 30 7

Prenatal maternal smoking is not significantly associated with plate/batch ($^2=$24.65; P-value=0.17 )

5.3 Flow chart of sample exclusions

methylCohort<-pdqc_all %>% filter(childteen!='M')

basemodelvar<-c("smkPreg_binary", "ancestry", "cm1bsex", "cm1inpov", "Epi", "IC", "ChildAgeComposite")
basemodeldata<-myFF %>% filter_at(all_of(basemodelvar), all_vars(!(. %in%c("Missing", "Missing PC data"))& !is.na(.)))
child_base<-basemodeldata%>%filter(childteen=='C')
teen_base<-basemodeldata%>%filter(childteen=='T')

prenatalexposurevar<-c(basemodelvar, "m1g2_YesNoPreg", "m1g3_YesNoPreg")
prenatalexdata<-basemodeldata %>%filter_at(all_of(prenatalexposurevar), all_vars(!(. %in% c("Missing"))))
child_prenatal<-prenatalexdata%>%filter(childteen=='C')
teen_prenatal<-prenatalexdata%>%filter(childteen=='T')

secondhandsmokevars<-c(prenatalexposurevar, "PostnatalMaternalSmokingAny", "SmkAtVisitPastmonth")
secondhandsmkdata<-prenatalexdata%>%filter_at(all_of(secondhandsmokevars), all_vars(!(. %in% c("Missing"))))
child_secondhand<-secondhandsmkdata %>% filter(childteen=='C')
teen_secondhand<-secondhandsmkdata%>%filter(childteen=='T')
twovisit_secondhand<-secondhandsmkdata %>% filter(idnum %in% child_secondhand$idnum & idnum %in% teen_secondhand$idnum)

child_smoke<-secondhandsmkdata %>% filter(k5f1l!= "2 no" & childteen=='C')
child_nosmoke<-secondhandsmkdata %>% filter(k5f1l=="2 no" & childteen=='C')
teen_smoke<-secondhandsmkdata %>% filter(k6d40!="2 No" & childteen=='T')
teen_nosmoke<-secondhandsmkdata %>% filter(k6d40=="2 No" & childteen=='T')  
twovisit_nosmoke<-secondhandsmkdata %>% filter(idnum %in% child_nosmoke$idnum & idnum %in% teen_nosmoke$idnum)

fathersmkdata<-secondhandsmkdata %>% filter(f1g4!="Missing")

#Make Labels
n_fullFF<-paste0("Fragile Families cohort \n", "n=", FF_labeled %>% nrow())
n_methylQC<-paste0("Cohort with 450K methylation data \n", "n=", methylCohort$idnum%>%unique()%>%length(), "individuals with m=", methylCohort$MethID%>%length(), "samples")

n_QCloss<-paste0("Methylation data QC: 12 low-intensity samples dropped, \n", pdqc_all%>%filter(probe_fail_10==T)%>%nrow()-12, " samples with >10% of probes failing dropped, \n", pdqc_all%>%filter(sex!=predicted_sex & probe_fail_10==F & childteen!='M')%>%nrow(), " discordant sex samples dropped, \n & ", nrow(pdqc)-nrow(pdqc_clean), " duplicate samples dropped")
n_methylFF<-paste0("Cohort with quality controlled 450K data \n","n=", myFF %>% select(id) %>% unique() %>% nrow(), " individuals with m=", myFF%>%nrow(), " samples")

n_basemodel<-paste0("Base models: \n", "n=", basemodeldata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", basemodeldata %>%nrow(), " samples")
n_baseexclusions<-paste0("n=",  myFF %>% filter_at(all_of(basemodelvar), any_vars((. %in%c("Missing", "Missing PC data"))))%>% select(id) %>% unique() %>% nrow(), " individuals & m=", myFF %>% filter_at(all_of(basemodelvar), any_vars((. %in%c("Missing", "Missing PC data")))) %>% nrow(), " samples missing principal components of ancestry; \n n=", myFF %>% filter_at(all_of(basemodelvar), any_vars((is.na(.))))%>%select(id)%>%unique()%>%nrow(), " individuals & m=", myFF %>% filter_at(all_of(basemodelvar), any_vars((is.na(.))))%>%nrow(), " samples missing child age at maternal/PCG interview")

n_prenatalexposures<-paste0("Prenatal exposures models: \n","n=", prenatalexdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", prenatalexdata %>%nrow(), " samples")
n_prenatalexclusions<-paste0("n=", basemodeldata %>% filter_at(all_of(prenatalexposurevar), any_vars(. %in% c("Missing"))) %>% select(id) %>% unique() %>% nrow(), " individuals & m=", basemodeldata %>% filter_at(all_of(prenatalexposurevar), any_vars(. %in% c("Missing"))) %>% nrow(), " samples missing \n data on maternal prenatal alcohol and drug use")

n_secondhandSmk<-paste0("Secondhand smoke exposure models: \n","n=", secondhandsmkdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", secondhandsmkdata %>%nrow(), " samples")
n_secondhandexclusions<-paste0("n=", prenatalexdata$id[!(prenatalexdata$id %in% secondhandsmkdata$id)] %>% unique()%>% length(), " individuals & m=", prenatalexdata %>% filter_at(all_of(secondhandsmokevars), any_vars(. %in% c("Missing"))) %>% nrow(), " samples missing data \n on maternal or primary care giver postnatal smoking")

n_Child<-paste0('n=', child_secondhand  %>% nrow(), " individuals with a 9 year visit")
n_ChildSmkexcluded<-paste0("Exclude children who smoke: \n",'n=', child_nosmoke %>% nrow(), " individuals with a 9 year visit")
n_ChildSmkexclusions<-paste0('n=', child_smoke%>%nrow(), " individual \n smoking at 9 year visit \n or missing focal child smoking data")
n_Teen<-paste0('n=', teen_secondhand  %>% nrow(), " individuals with a 15 year visit")
n_TeenSmkexcluded<-paste0("Exclude children who smoke: \n",'n=', teen_nosmoke %>% nrow(), " individuals with a 15 year visit")
n_TeenSmkexclusions<-paste0('n=', teen_smoke%>%nrow(), " individuals \n smoking at 15 year visit \n or missing focal child smoking data") 

n_sensFatherSmoking<-paste0("Sensitivity analysis: compare effects of maternal and paternal prenatal smoking \n", "n=", fathersmkdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", fathersmkdata %>%nrow(), " samples", "\n with prenatal paternal smoking data")
n_sensFatherSmokingEx<- paste0("n=", secondhandsmkdata$id[!(secondhandsmkdata$id %in% fathersmkdata$id)] %>% unique()%>% length(), " individuals & m=", secondhandsmkdata %>% filter(f1g4 %in% c("Missing")) %>% nrow(), " samples missing data on paternal prenatal smoking")

n_twovisits<-paste0("n=", twovisit_secondhand%>%pull(idnum) %>%unique() %>%length(), " individuals with both a 9 and 15 year visit")
n_twovisitsSmkExcluded<-paste0("Exclude children who smoke: \n", "n=", twovisit_nosmoke%>%pull(idnum) %>%unique() %>%length(), " individuals with both a 9 and 15 year visit")
n_twovisitSmkExclusions<-paste0("n=", (twovisit_secondhand%>%pull(idnum) %>%unique() %>%length())-(twovisit_nosmoke%>%pull(idnum) %>%unique() %>%length()), " individuals smoking at \n either 9 and 15 year visit or \n missing focal child smoking data")
graph1<-DiagrammeR::grViz("
   digraph graph2 {
   compound=true;
   graph [layout = dot]
   
   node [shape = rectangle, width=4]
   a [label = '@@1']
   b [label = '@@2']
   c [label = '@@3']
   d [label = '@@4']
   e [label = '@@5']
   f [label = '@@6']
   g [label = '@@7']
   h [label = '@@8']
   i [label = '@@9', group=g1]
   j [label = '@@10']
   k [label = '@@16']
   l [label = '@@17']
   m [label = '@@19']
   
   #fake nodes for connecting to edges
   node [shape=point, width=0.01, height=0.01]
   a1
   b1
   c1
   d1
#   e1
  f1 [group=g1]
  g1
  k1
   
  #exclusion nodes
  node [shape = rectangle, width=4]
  aa [label = '@@20']
  bb [label = '@@11']
  cc [label = '@@12']
  dd [label = '@@13']
  ff [label = '@@14', width=3]
  gg [label = '@@15', width=3]
  kk [label = '@@18', width=3]
    
  #draw connections
  a->m
  m->a1 [arrowhead=none]
  a1->b
  b->b1 [arrowhead=none]
  b1->c
  c->c1 [arrowhead=none]
  c1->d
  d->d1 [arrowhead=none]
  d1->e
  e->f; e->g; e->k
{rank=same; h->e [dir=back, minlen=2, style=dashed]}
  f->f1 [arrowhead=none]
  g->g1 [arrowhead=none]
  k->k1 [arrowhead=none]
  f1->i 
  g1->j 
  k1->l 
  
{rank=same; a1->aa [minlen=2]}   
{rank=same; b1->bb [minlen=2]} 
{rank=same; c1->cc [minlen=2]}
{rank=same; d1->dd [minlen=2]}
{rank=same; f1->ff [minlen=2]}
{rank=same; g1->gg [minlen=2]}
{rank=same; k1->kk [minlen=2]}

#invisible constraints to force ordering
ff->i [style=invis]
   }   
   
   [1]: n_fullFF
   [2]: n_methylFF
   [3]: n_basemodel
   [4]: n_prenatalexposures
   [5]: n_secondhandSmk
   [6]: n_Child
   [7]: n_Teen
   [8]: n_sensFatherSmoking
   [9]: n_ChildSmkexcluded
   [10]: n_TeenSmkexcluded
   [11]: n_baseexclusions
   [12]: n_prenatalexclusions
   [13]: n_secondhandexclusions
   [14]: n_ChildSmkexclusions
   [15]: n_TeenSmkexclusions
   [16]: n_twovisits
   [17]: n_twovisitsSmkExcluded
   [18]: n_twovisitSmkExclusions
   [19]: n_methylQC
   [20]: n_QCloss
")
graph1

5.4 Comparison of analytic sample: whole sample of Fragile Families vs methylation data vs analytic subset

analysis_data<-rbind(child_nosmoke, teen_nosmoke)
methyl1<-createTable(compareGroups(MethylData~smkPreg_binary+cm1inpov+cm1ethrace+cm1bsex+m1g2_YesNoPreg+m1g3_YesNoPreg+PostnatalMaternalSmokingAny+k5f1l+k6d40+f1g4, data=FF_labeled), show.all=T)
export2md(methyl1, caption="Exclusion table of all Fragile Families vs those with quality controlled methylation data for main predictors and key demographic variables from first wave")
Exclusion table of all Fragile Families vs those with quality controlled methylation data for main predictors and key demographic variables from first wave
[ALL] Analysis subset Not in analysis subset p.overall
N=4898 N=880 N=4018
Any maternal smoking during pregnancy: 0.230
Missing 12 (0.24%) 0 (0.00%) 12 (0.30%)
No 3934 (80.3%) 702 (79.8%) 3232 (80.4%)
Yes 952 (19.4%) 178 (20.2%) 774 (19.3%)
cm1inpov Constructed - Poverty ratio - mother’s household income/poverty threshold 2.22 (2.41) 2.23 (2.43) 2.22 (2.41) 0.981
cm1ethrace Mother race (baseline own report): .
1 white, non-hispanic 1030 (21.0%) 174 (19.8%) 856 (21.3%)
2 black, non-hispanic 2326 (47.5%) 485 (55.1%) 1841 (45.8%)
3 hispanic 1336 (27.3%) 188 (21.4%) 1148 (28.6%)
4 other 194 (3.96%) 31 (3.52%) 163 (4.06%)
Missing 12 (0.24%) 2 (0.23%) 10 (0.25%)
cm1bsex Constructed - Focal baby’s gender: 0.348
1 Boy 2568 (52.4%) 444 (50.5%) 2124 (52.9%)
2 Girl 2329 (47.6%) 436 (49.5%) 1893 (47.1%)
Missing 1 (0.02%) 0 (0.00%) 1 (0.02%)
Any alcohol during pregnancy: 0.635
Missing 13 (0.27%) 2 (0.23%) 11 (0.27%)
No 4364 (89.1%) 777 (88.3%) 3587 (89.3%)
Yes 521 (10.6%) 101 (11.5%) 420 (10.5%)
Any drugs during pregnancy: 0.478
Missing 11 (0.22%) 3 (0.34%) 8 (0.20%)
No 4616 (94.2%) 833 (94.7%) 3783 (94.2%)
Yes 271 (5.53%) 44 (5.00%) 227 (5.65%)
Any postnatal maternal smoking (age 1 & 5): <0.001
Maternal smoking at age 1 or age 5 1545 (31.5%) 333 (37.8%) 1212 (30.2%)
Missing 827 (16.9%) 47 (5.34%) 780 (19.4%)
No maternal smoking at age 1 and age 5 2526 (51.6%) 500 (56.8%) 2026 (50.4%)
k5f1l F1L. Smoked a cigarette or used tobacco: <0.001
1 yes 26 (0.53%) 6 (0.68%) 20 (0.50%)
2 no 3313 (67.6%) 859 (97.6%) 2454 (61.1%)
Missing 1559 (31.8%) 15 (1.70%) 1544 (38.4%)
k6d40 D40. Ever smoked an entire cigarette?: <0.001
1 Yes 185 (3.78%) 41 (4.66%) 144 (3.58%)
2 No 3244 (66.2%) 829 (94.2%) 2415 (60.1%)
Missing 1469 (30.0%) 10 (1.14%) 1459 (36.3%)
f1g4 In the past 3 months, how many cigarettes did you smoke a day?: 0.029
1 2+pk/d 62 (1.27%) 11 (1.25%) 51 (1.27%)
2 1<pk<2 364 (7.43%) 73 (8.30%) 291 (7.24%)
3 <1pk/d 1067 (21.8%) 204 (23.2%) 863 (21.5%)
4 None 2327 (47.5%) 434 (49.3%) 1893 (47.1%)
Missing 1078 (22.0%) 158 (18.0%) 920 (22.9%)
myFF$MethylData2<-ifelse(myFF$MethID %in% c(child_nosmoke$MethID, teen_nosmoke$MethID), 'Final analysis group', 'Not in final analysis group')
set_label(myFF$MethylData2)<-'Exclusions based on missing data'
methyl2<-createTable(compareGroups(MethylData2~smkPreg_binary+globalmethylation+pediatric+Epi+IC+anynewborn_center+SSnewbornCT_center+SSolder_center+cm1inpov+ancestry+cm1bsex+m1g2_YesNoPreg+m1g3_YesNoPreg+PostnatalMaternalSmokingAny+SmkAtVisitPastmonth, data=myFF), show.all=T)
export2md(strataTable(methyl2, 'childteen'), caption="Exclusion table of subset with quality controlled methylation data vs final analytic subset for main predictor, outcomes and key demographic variables")
Exclusion table of subset with quality controlled methylation data vs final analytic subset for main predictor, outcomes and key demographic variables

C
T
[ALL] Final analysis group Not in final analysis group p.overall [ALL] Final analysis group Not in final analysis group p.overall
N=829 N=723 N=106 N=856 N=720 N=136
Any maternal smoking during pregnancy: 0.608 0.062
No 661 (79.7%) 574 (79.4%) 87 (82.1%) 683 (79.8%) 583 (81.0%) 100 (73.5%)
Yes 168 (20.3%) 149 (20.6%) 19 (17.9%) 173 (20.2%) 137 (19.0%) 36 (26.5%)
Global methylation 0.48 (0.01) 0.48 (0.01) 0.48 (0.01) 0.458 0.48 (0.01) 0.48 (0.01) 0.48 (0.01) 0.092
pediatric methylation clock 8.99 (0.92) 8.99 (0.94) 8.99 (0.78) 0.973 11.6 (1.73) 11.6 (1.75) 11.9 (1.61) 0.080
Epithelial cell proportion 0.27 (0.13) 0.27 (0.13) 0.29 (0.16) 0.353 0.31 (0.16) 0.30 (0.16) 0.33 (0.15) 0.037
Immune cell proportion 0.71 (0.13) 0.71 (0.12) 0.70 (0.15) 0.408 0.68 (0.15) 0.69 (0.15) 0.66 (0.15) 0.036
PES: Any smoking, newborn cordblood - mean center -0.02 (0.22) -0.02 (0.21) -0.02 (0.23) 0.921 0.02 (0.23) 0.02 (0.23) 0.04 (0.23) 0.316
PES: Sustained smoking, newborn cordblood (+CT control) - mean center -0.01 (0.23) -0.01 (0.23) -0.03 (0.26) 0.497 0.02 (0.23) 0.02 (0.23) 0.01 (0.23) 0.576
PES: Sustained smoking, older children peripheral blood - mean center 0.00 (0.25) 0.01 (0.24) -0.02 (0.30) 0.297 0.00 (0.28) 0.01 (0.28) -0.04 (0.28) 0.051
cm1inpov Constructed - Poverty ratio - mother’s household income/poverty threshold 2.25 (2.45) 2.21 (2.41) 2.49 (2.69) 0.318 2.23 (2.44) 2.21 (2.43) 2.32 (2.45) 0.644
Ancestry from PCA of child’s genetic data: <0.001 <0.001
African ancestry 484 (58.4%) 452 (62.5%) 32 (30.2%) 496 (57.9%) 449 (62.4%) 47 (34.6%)
European ancestry 124 (15.0%) 115 (15.9%) 9 (8.49%) 132 (15.4%) 119 (16.5%) 13 (9.56%)
Hispanic ancestry 180 (21.7%) 156 (21.6%) 24 (22.6%) 182 (21.3%) 152 (21.1%) 30 (22.1%)
Missing PC data 41 (4.95%) 0 (0.00%) 41 (38.7%) 46 (5.37%) 0 (0.00%) 46 (33.8%)
cm1bsex Constructed - Focal baby’s gender: 0.119 0.003
1 Boy 417 (50.3%) 356 (49.2%) 61 (57.5%) 427 (49.9%) 343 (47.6%) 84 (61.8%)
2 Girl 412 (49.7%) 367 (50.8%) 45 (42.5%) 429 (50.1%) 377 (52.4%) 52 (38.2%)
Missing 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%)
Any alcohol during pregnancy: 0.002 0.043
Missing 2 (0.24%) 0 (0.00%) 2 (1.89%) 2 (0.23%) 0 (0.00%) 2 (1.47%)
No 731 (88.2%) 633 (87.6%) 98 (92.5%) 757 (88.4%) 638 (88.6%) 119 (87.5%)
Yes 96 (11.6%) 90 (12.4%) 6 (5.66%) 97 (11.3%) 82 (11.4%) 15 (11.0%)
Any drugs during pregnancy: 0.003 0.007
Missing 3 (0.36%) 0 (0.00%) 3 (2.83%) 3 (0.35%) 0 (0.00%) 3 (2.21%)
No 785 (94.7%) 687 (95.0%) 98 (92.5%) 810 (94.6%) 683 (94.9%) 127 (93.4%)
Yes 41 (4.95%) 36 (4.98%) 5 (4.72%) 43 (5.02%) 37 (5.14%) 6 (4.41%)
Any postnatal maternal smoking (age 1 & 5): <0.001 <0.001
Maternal smoking at age 1 or age 5 315 (38.0%) 288 (39.8%) 27 (25.5%) 323 (37.7%) 281 (39.0%) 42 (30.9%)
Missing 41 (4.95%) 0 (0.00%) 41 (38.7%) 47 (5.49%) 0 (0.00%) 47 (34.6%)
No maternal smoking at age 1 and age 5 473 (57.1%) 435 (60.2%) 38 (35.8%) 486 (56.8%) 439 (61.0%) 47 (34.6%)
Mother/PCG (9/15 yr) smoking in month prior to interview (pk/day): <0.001 <0.001
Less than pack a day 193 (23.3%) 173 (23.9%) 20 (19.0%) 184 (21.5%) 157 (21.8%) 27 (19.9%)
Missing 13 (1.57%) 0 (0.00%) 13 (12.4%) 5 (0.58%) 0 (0.00%) 5 (3.68%)
No smoking 544 (65.7%) 481 (66.5%) 63 (60.0%) 628 (73.4%) 533 (74.0%) 95 (69.9%)
Pack or more a day 78 (9.42%) 69 (9.54%) 9 (8.57%) 39 (4.56%) 30 (4.17%) 9 (6.62%)

6 Datasets for modeling

Variables

Base model: Binary smoking variable [X], ancestry categorization, baby sex [X], maternal income to poverty ratio [X], child’s age in month at visit of interest (note, currently using a constructed child’s age at visit, where individuals having a missing child’s age at home visit was supplemented with child’s age at home interview) [X], cell types (Epi + IC, may need to take one out),

Prenatal exposures: Base model variables + prenatal maternal alcohol use (Yes/no) [X], prenatal maternal drug use (yes/no)

Secondhand smoke exposure models: Prenatal exposures model+ any maternal smoking at age 1 & 5 [X], maternal or PCG smoking dose at visit of interest (age 9 or 15) [X]

First hand smoking: Secondhand smoke exposure models + exclude children with missing or reported firsthand smoking at visit of interest (age 9 or 15)

Paternal smoking: Filter to individuals with data on prenatal paternal smoking and compare effect size of prenatal paternal to prenatal maternal (as per Wiklund et al. 2019, idea is that this captures additional confounding)

Ancestry specific models - by ancestry groups; replace ancestry control with control for local PCs within ancestry group

Longitduinal models - no additional filtering needed

save(child_base, file=paste0(datadir, 'CreatedData/', 'ChildBaseModelData.Rdata'))
save(teen_base, file=paste0(datadir, 'CreatedData/', 'ChildBaseModelData.Rdata'))
save(basemodeldata, file=paste0(datadir, 'CreatedData/', 'BaseModelData.Rdata'))

save(child_prenatal, file=paste0(datadir, 'CreatedData/', 'ChildPrenatalExModelData.Rdata'))
save(teen_prenatal, file=paste0(datadir, 'CreatedData/', 'PrenatalExModelData.Rdata'))
save(prenatalexdata, file=paste0(datadir, 'CreatedData/', 'PrenatalExModelData.Rdata'))

save(child_secondhand, file=paste0(datadir, 'CreatedData/', 'ChildSecondHandSmkModelData.Rdata'))
save(teen_secondhand, file=paste0(datadir, 'CreatedData/', 'TeenSecondHandSmkModelData.Rdata'))
save(twovisit_secondhand, file=paste0(datadir, 'CreatedData/', 'TwoVisitSecondHandSmkModelData.Rdata'))
save(secondhandsmkdata, file=paste0(datadir, 'CreatedData/', 'SecondHandSmkModelData.Rdata'))

save(child_nosmoke, file=paste0(datadir, 'CreatedData/', 'ChildNoSmkModelData.Rdata'))
save(teen_nosmoke, file=paste0(datadir, 'CreatedData/', 'TeenNoSmkModelData.Rdata'))
save(twovisit_nosmoke, file=paste0(datadir, 'CreatedData/', 'TwoVisitNoSmkModelData.Rdata'))


save(fathersmkdata, file=paste0(datadir, 'CreatedData/', 'FatherSmkModelData.Rdata'))


save(list=c("basemodeldata", "prenatalexdata", "secondhandsmkdata", "child_secondhand", "teen_secondhand", "twovisit_secondhand", "twovisit_nosmoke", "child_nosmoke", "teen_nosmoke", "fathersmkdata"), file=paste0(datadir, 'CreatedData/', Sys.Date(), 'CreatedData.Rdata'))